Going to try replicating the best classifier we've managed so far by using only a single feature, then include more features until the performance changes.
In [1]:
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
In [2]:
%matplotlib inline
plt.rcParams['figure.figsize'] = 6, 4.5
plt.rcParams['axes.grid'] = True
plt.gray()
In [3]:
cd ..
In [4]:
import train
import json
import imp
In [5]:
settings = json.load(open('SETTINGS.json', 'r'))
In [6]:
settings['FEATURES']
Out[6]:
In [7]:
data = train.get_data(settings['FEATURES'])
In [9]:
!free -m
In [9]:
# getting a set of the subjects involved
subjects = set(list(data.values())[0].keys())
print(subjects)
In [10]:
import sklearn.preprocessing
import sklearn.pipeline
import sklearn.ensemble
import sklearn.cross_validation
from train import utils
We want to do some simple feature selection, as even with a massive amount of RAM available there's no point in using features that are obviously useless. The first suggestion for this is a variance threshold, removing features with low variances.
In [13]:
X,y = utils.build_training(list(subjects)[0],list(data.keys()),data)
In [27]:
h=plt.hist(np.log10(np.var(X,axis=0)))
However, I don't really like this much, as low variance doesn't mean that there won't be information there. After all, variance scales with the multiplicative constants.
A better approach is scikit-learns SelectKBest, which can use $\chi^2$ or ANOVA f-values. Can't use $\chi^2$ as demands non-negative features. Trying each down to the 50 best and attempting to plot in 2d with PCA:
In [28]:
import sklearn.feature_selection
In [30]:
Xbest = sklearn.feature_selection.SelectKBest(sklearn.feature_selection.f_classif, k=50).fit_transform(X,y)
In [31]:
import sklearn.decomposition
In [40]:
pca = sklearn.decomposition.PCA(n_components=2)
scaler = sklearn.preprocessing.StandardScaler()
twodX = pca.fit_transform(scaler.fit_transform(Xbest))
plt.scatter(twodX[:,0][y==1],twodX[:,1][y==1],c='blue')
plt.scatter(twodX[:,0][y==0],twodX[:,1][y==0],c='red')
Out[40]:
Looking good, now doing the same with the magical t-SNE:
In [41]:
import sklearn.manifold
In [42]:
tsne = sklearn.manifold.TSNE()
twodX = tsne.fit_transform(scaler.fit_transform(Xbest))
plt.scatter(twodX[:,0][y==1],twodX[:,1][y==1],c='blue')
plt.scatter(twodX[:,0][y==0],twodX[:,1][y==0],c='red')
Out[42]:
Also looking good.
So, all we do is add the selection to our model and then also turn the n_estimators up and we should get a better prediction.
In [44]:
selection = sklearn.feature_selection.SelectKBest(sklearn.feature_selection.f_classif,k=3000)
scaler = sklearn.preprocessing.StandardScaler()
forest = sklearn.ensemble.RandomForestClassifier()
model = sklearn.pipeline.Pipeline([('sel',selection),('scl',scaler),('clf',forest)])
Testing this model on this single subject:
In [45]:
def subjpredictions(subject,model,data):
X,y = utils.build_training(subject,list(data.keys()),data)
cv = sklearn.cross_validation.StratifiedShuffleSplit(y)
predictions = []
labels = []
allweights = []
for train,test in cv:
# calculate weights
weight = len(y[train])/sum(y[train])
weights = np.array([weight if i == 1 else 1 for i in y[train]])
model.fit(X[train],y[train],clf__sample_weight=weights)
predictions.append(model.predict_proba(X[test]))
weight = len(y[test])/sum(y[test])
weights = np.array([weight if i == 1 else 1 for i in y[test]])
allweights.append(weights)
labels.append(y[test])
predictions = np.vstack(predictions)[:,1]
labels = np.hstack(labels)
weights = np.hstack(allweights)
return predictions,labels,weights
In [47]:
p,l,w = subjpredictions(list(subjects)[0],model,data)
In [48]:
sklearn.metrics.roc_auc_score(l,p)
Out[48]:
In [49]:
fpr,tpr,thresholds = sklearn.metrics.roc_curve(l,p)
plt.plot(fpr,tpr)
Out[49]:
It certainly works a bit better than the classifier I was working with before. What if we increase the number of estimators to deal with the much larger number of features?
In [51]:
model.set_params(clf__n_estimators=5000)
Out[51]:
In [53]:
%%time
p,l,w = subjpredictions(list(subjects)[0],model,data)
In [54]:
sklearn.metrics.roc_auc_score(l,p)
Out[54]:
In [55]:
fpr,tpr,thresholds = sklearn.metrics.roc_curve(l,p)
plt.plot(fpr,tpr)
Out[55]:
Actually, it looks like I could probably just run this in the notebook, and it'd probably be fine. Will write the script after doing this.
In [56]:
features = list(data.keys())
In [57]:
%%time
predictiondict = {}
for subj in subjects:
# training step
X,y = utils.build_training(subj,features,data)
# weights
weight = len(y)/sum(y)
weights = np.array([weight if i == 1 else 1 for i in y])
model.fit(X,y,clf__sample_weight=weights)
# prediction step
X,segments = utils.build_test(subj,features,data)
predictions = model.predict_proba(X)
for segment,prediction in zip(segments,predictions):
predictiondict[segment] = prediction
In [58]:
import csv
In [59]:
with open("output/protosubmission.csv","w") as f:
c = csv.writer(f)
c.writerow(['clip','preictal'])
for seg in predictiondict.keys():
c.writerow([seg,"%s"%predictiondict[seg][-1]])
Submitting now, and got approximately 0.4, worse than all zeros. How did that happen? Something to do with the warnings above, possibly?
Checking if there's anything obviously weird with the output file:
In [60]:
!head output/protosubmission.csv
Nope, looks ok.
There are three warnings above, so this problem is only occurring on three of the subjects. Could run each subject individually and try to find one where the ROC completely falls apart.
In [ ]:
%%time
for s in subjects:
p,l,w = subjpredictions(s,model,data)
print("subject %s"%s, sklearn.metrics.roc_auc_score(l,p,sample_weight=w))
Simple solution is apparently to run a variance threshold before running the feature selection (like it says on the wiki really).
In [62]:
thresh = sklearn.feature_selection.VarianceThreshold()
selection = sklearn.feature_selection.SelectKBest(sklearn.feature_selection.f_classif,k=3000)
scaler = sklearn.preprocessing.StandardScaler()
forest = sklearn.ensemble.RandomForestClassifier()
model = sklearn.pipeline.Pipeline([('thr',thresh),('sel',selection),('scl',scaler),('clf',forest)])
In [63]:
%%time
predictiondict = {}
for subj in subjects:
# training step
X,y = utils.build_training(subj,features,data)
# weights
weight = len(y)/sum(y)
weights = np.array([weight if i == 1 else 1 for i in y])
model.fit(X,y,clf__sample_weight=weights)
# prediction step
X,segments = utils.build_test(subj,features,data)
predictions = model.predict_proba(X)
for segment,prediction in zip(segments,predictions):
predictiondict[segment] = prediction
In [64]:
with open("output/protosubmission.csv","w") as f:
c = csv.writer(f)
c.writerow(['clip','preictal'])
for seg in predictiondict.keys():
c.writerow([seg,"%s"%predictiondict[seg][-1]])